ReneWind

Problem Statement

Renewable energy sources play an increasingly important role in the global energy mix, as the effort to reduce the environmental impact of energy production increases.

Out of all the renewable energy alternatives, wind energy is one of the most developed technologies worldwide. The U.S Department of Energy has put together a guide to achieving operational efficiency using predictive maintenance practices.

Predictive maintenance uses sensor information and analysis methods to measure and predict degradation and future component capability. The idea behind predictive maintenance is that failure patterns are predictable and if component failure can be predicted accurately and the component is replaced before it fails, the costs of operation and maintenance will be much lower.

The sensors fitted across different machines involved in the process of energy generation collect data related to various environmental factors (temperature, humidity, wind speed, etc.) and additional features related to various parts of the wind turbine (gearbox, tower, blades, break, etc.).

Objective

“ReneWind” is a company working on improving the machinery/processes involved in the production of wind energy using machine learning and has collected data of generator failure of wind turbines using sensors. They have shared a ciphered version of the data, as the data collected through sensors is confidential (the type of data collected varies with companies). Data has 40 predictors, 40000 observations in the training set and 10000 in the test set.

The objective is to build various classification models, tune them and find the best one that will help identify failures so that the generator could be repaired before failing/breaking and the overall maintenance cost of the generators can be brought down.

“1” in the target variables should be considered as “failure” and “0” will represent “No failure”.

The nature of predictions made by the classification model will translate as follows:

So, the maintenance cost associated with the model would be:

Maintenance cost = TP*(Repair cost) + FN*(Replacement cost) + FP*(Inspection cost) where,

Here the objective is to reduce the maintenance cost so, we want a metric that could reduce the maintenance cost.

So, we will try to maximize the ratio of minimum possible maintenance cost and the maintenance cost associated with the model.

The value of this ratio will lie between 0 and 1, the ratio will be 1 only when the maintenance cost associated with the model will be equal to the minimum possible maintenance cost.

Data Dictionary

Outline

1. Data Overview

Importing necessary libraries and data

Let us take a look at the imported data and the summary of different columns:

Now we check the missing values in the data. Below, number of missing values in any column of the imported data are shown:

There are 46 and 39 missing values in columns V1 and V2, respectively. We'll explore this further.

2. Exploratory Data Analysis (EDA)

First, we define a few functions for EDA and next we proceed with our analysis.

Summary of Quantitative Variables

Let us view the statistical summary of the numerical columns in the data.

Boxplot of All Quantitative Columns

We have 40 variables which we do not have a proper information on them. Plotting each variable may not give us any insights on the variables. Hence, in the following we present boxplots of all variables.

Individual Histogram of Features

Correlations

Highly correlated variables are listed as following:

Correlation between features does not generally affect the predictive performance of learning models. However, for case of our study, correlation is a problem because we aim to do inference on our models. As a result, later we delete highly correlated features from the data.

Features vs. Target

Features vs. Features

We select a set of high correlated features to study their relation agaings each others.

Observations

3. Data Pre-processing (EDA)

3.1 Data Preparation for Modeling

For creating training and validation sets we do the followings:

Splitting data into training and validation sets
Reading test data

The percentage of certified and denied Visas in the whole data set, the training set, and the test set are almost similar. Hence, splitted data sets have a good distribution for generator failure status.

The sets are imbalanced due to the fact that the percentage of generator failure is near 5.47% among all cases.

3.3 Missing-Value Treatment

We have missing values in the V1 and V2 columns.

The values obtained might not be integer always. But, since all of our features have continuous values do not need to round the obtained values.

3.4 Outlier Detection and Treatment

We use IQR, which is the interval going from the 1st quartile to the 3rd quartile of the data in question, and then flag points for investigation if they are outside 1.5 * IQR.

Outlier Detection:

Let us look at the boxplot of our quantitative variables.

All quantitative variables have outliers.

Outlier Treatment on training, validation, and test sets:

We treat outliers in each set by flooring and capping based on training set statistics. In the following, we make a function to calculate lower and upper whiskers of the training data attributes and use them for feature clipping in training,validation, or test sets. We do this because in order to avoid data leakage, any data transformation on the sets should be based on training set statistics.

Now, we apply the frac_outside_IQR function on training, validation, and test data.

Boxplot below shows that there are no more outliers in the quantitative features of the training set.

3.4 Removing Highly Correlated Features

Collinearity occurs when predictor variables in a model are highly correlated.Removing collinear features can help a model to generalize and improves the interpretability of the model. In the following, we create a function to remove highly correlated features.

4. Model Evaluation Criterion

Three types of cost are associated with the provided problem

  1. Replacement cost - False Negatives - Predicting no failure, while there will be a failure
  2. Inspection cost - False Positives - Predicting failure, while there is no failure
  3. Repair cost - True Positives - Predicting failure correctly

How to reduce the overall cost?

Let's create two functions to calculate different metrics and confusion matrix, so that we don't have to use the same code repeatedly for each model.

Defining scorer to be used for hyperparameter tuning

5. Model Building with Original Data

5.1- Classifying Models

5.2 Model Performance Evaluation

Cross Validation on Training Data

Model Performance on Training Data

Model Performance on Validation Data

6. Model Building with OverSampled Data

6.1- OverSampled Data

6.2- Classifying Models

6.3 Model Performance Evaluation

Cross Validation on OverSampled Training Data

Model Performance on Training Data

Model Performance on Validation Data

7. Model Building with Undersampled data

7.1- UnderSampled Data

7.2- Classifying Models

7.3 Model Performance Evaluation

Cross Validation on UnderSampled Training Data

Model Performance on Training Data

Model Performance on Validation Data

8. Model Selection for Tuning

Let us look at the summary of evaluation scores on the models to compare model performances and choose the three best ones.

8.1 Cross Validation Score of all Models

8.2 Performance Summary of all Models on Training Set

8.2 Performance Summary of all Models on Validation Set

8.3 Three Chosen Models

We Choose Random Forest_over, Xgboost_over, and dtree_over as three best performing models among all the models built previously to further tune them to improve the performance. The reason for the choices is that:

9. HyperparameterTuning

We aim to tune Random Forest_over, Xgboost_over, and dtree_over classifying models.

9.1 Important Features

9.2 Tuning Random Forest on OverSamlpled Data

$\;\;\;\;$ RandomizedSearchCV

9.3 Tuning Xgboost on OverSamlpled Data

$\;\;\;\;$ RandomizedSearchCV

9.4 Tuning Decision Tree on OverSamlpled Data

$\;\;\;\;$ GridSearchCV

9.5 Comparison of Tuned Models

Model Performance on Training Data
Model Performance on Validation Data

10. Comparing all Models

10.1 Summary of Performances on Training Data

10.2 Summary of Performances on Validation Data

10.3 Models with Highest "Minimum_Vs_Model_cost"

Among all models, Minimum_Vs_Model_cost of the following models is greater than 0.78 on the validation set.

Let us take a look at the training performance of these models.

11. The Final Model

11.1 Selecting the Final Model

The best model is the tuned random forest classifier with the following parameters (random_state=1, n_estimators=250, min_samples_leaf= 1,max_samples= 0.5, max_features= 'sqrt') trained over our oversampled data.

11.2 Performance of the Final Model on Test Data

12 Pipelines for Productionizing the Final Model

Now, we have a final model. let's use pipelines to put the model into production. We will create a pipelines to:

Pipeline
Prediction